Text block segmentation using pyramid structure

نویسندگان

  • Chew Lim Tan
  • Zheng Zhang
چکیده

Text block segmentation is necessary in document layout analysis. An algorithm and its implementation that segregates text block by block (a block is either a title or a paragraph) from the provided document, e.g. newspaper image, based on pyramid structure is described in this paper. The pyramid structure, which is amenable for parallel processing on output, is a multi-resolution image representation. The pyramid structure also simulates what the human eyes see the document from a far visualizing the block structure of the document. The block segmentation can identify the titles, and distinguish different paragraphs based on the indentation between them. Our implementation will be used in a news articles retrieval project.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptive Region Growing Color Segmentation for Text Using Irregular Pyramid

This paper presents the result of an adaptive region growing segmentation technique for color document images using an irregular pyramid structure. The emphasis is in the segmentation of textual components for subsequence extraction in document analysis. The segmentation is done in the RGB color space. A simple color distance measurement and a category of color thresholds are derived. The propo...

متن کامل

Hierarchical Text Segmentation from Multi-Scale Lexical Cohesion

This paper presents a novel unsupervised method for hierarchical topic segmentation. Lexical cohesion – the workhorse of unsupervised linear segmentation – is treated as a multi-scale phenomenon, and formalized in a Bayesian setting. Each word token is modeled as a draw from a pyramid of latent topic models, where the structure of the pyramid is constrained to induce a hierarchical segmentation...

متن کامل

The Redundancy Pyramid and its Application to Image Segmentation

Irregular pyramids organise several topological partitions which can be deduced from a partition (called the base of the pyramid) by successive unions of regions. In this paper, we introduce the redundancy pyramid structure. This structure accounts for redundant topological structures present in several topological partitions. We apply it to the problem of segmentation fusion, where we show tha...

متن کامل

An entropy based segmentation algorithm for computer-generated document images

This paper presents an efficient compression-oriented segmentation algorithm for computer-generated document images. In this algorithm, a document image is represented in a block-based multiscale pyramid. Then, image blocks will be characterized based on their entropy values of the intensity histogram, and the entropy distribution are assumed to be Gaussian priors in this work. We will discuss ...

متن کامل

Using Irregular Pyramid for Text Segmentation and Binarization of Gray Scale Image

Compared to binary images that most text extraction methods work on, gray scale images provides much more information for the extraction task. On the other hand complication also arises in determining the subject textual content from its background region (ie. thresholding) before the actual text extraction process can begin. Differing from the usual sequence of processes where document images ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001